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PQ ■ Abstract 



We propose a new and computationally efficient algorithm for maximizing the 

observed log-likelihood for a multivariate normal data matrix with missing values. 

We show that our procedure based on iteratively regressing the missing on the ob- 
'^ , • served variables, generalizes the traditional EM algorithm by alternating between 

different complete data spaces and performing the E-Step incrementally. In this non- 
O^l ' standard setup we prove numerical convergence to a stationary point of the observed 

log-likelihood. 

For high-dimensional data, where the number of variables may greatly exceed 
f^ , sample size, we add a Lasso penalty in the regression part of our algorithm and perform 

f^ ' coordinate descent approximations. This leads to a computationally very attractive 

technique with sparse regression coefficients for missing data imputation. Simulations 
f*"*) ' and results on four microarray datasets show that the new method often outperforms 

f^ , alternative imputation techniques as k-nearest neighbors imputation, nuclear norm 

minimization or a penalized likelihood approach with an £i-penalty on the inverse 

covariance matrix. 
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1 Introduction and Motivation 



Missing data imputation for large datasets is an essential pre-processing step in com- 
plex data applications. A well-known example are microarray datasets which contain the 
expression values of thousands of genes from a series of experiments. Missing values are in- 
herent to such datasets. They occur for diverse reasons, e.g. insufficient resolution, image 
corruption, dust or scratches on the slides. Apart from microarrays, much attention has 
been recently given to the so-called matrix completion problem. Most prominent in this 
context is the "Netflix" movie rating dataset with rows corresponding to customers and 
columns representing their movie rating and the customers have seen/rated only a small 
fraction of all possible movies. The goal is to estimate the ratings for unseen movies. In 
this paper, we propose a new and computationally efficient EM-type algorithm for missing 
value imputation in the high-dimensional multivariate normal model where the number of 
variables p can greatly exceed the number of independent samples n. While our motivating 
examples are microarray datasets with expression values of p different genes (variables) 
and n different experimental conditions (samples), our method is applicable to a wide 
range of problems where the data is arranged in the form of a nxp matrix with a possibly 
large fraction of the entries missing. The Gaussian assumption in our model is used for 
computation of the likelihood but empirical findings suggest that the method is useful for 
many continuous data matrices. 

There is a growing literature of missing value imputation methods. For microarray data, 
examples are k-nearest ne ighbors imputation and singular value decomposition imputation 
( Trovanskava et all l200ll ). imputation based on Bayesian prir icipal component analysis 



(lOba et ail 12003 ) or the local least squares imputation from Kim et ali (l2pOG) . For a 



pone 
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review and a discussion on microarray data imputation see lAittokallid ( 2010 ). In the 
context of the so-called matrix completion problem, where the goal is to recover a low-rank 
matrix from an incomplete set of entries, it has been shown in a series of fascinating papers 
that one can recover the missing data entries by solving a conv ex optimization problem , 
namely, nuclear- r iorm niinimization subject t o data constraints ( Candes and Rechtl . 120091 : 
Candes and Tad . l2009l : iKeshavan et ad. 12009). E ff icient conv ex algorithms for t he ma trix 
completion problem were proposed bv ICai et ali ( 20081 ) and iMazumder et ali ( 20091 ) . If 
the missing data problem does not arise from a near low rank matrix scenario, there is 
substantial room to improve upon the convex matrix completion algorithms. We will 
empirically demonstrate this point for some microarray high-throughput biological data. 



Here , we address the missing data problem through a likelihood approach (JLittle and Rubin 
19871 : ISchafed . 119971 ). We model the correlation between different variables in the data by 
using the Multivariate Normal Model (MVN) AA(/i, S) with a p- dimensional covariance 



matrix S. Recently, in the high-dimensional setup with p ^ n, IStadler and Biihlmann 



( 20ld ) proposed to maximize the penalized observed log-likelihood with an ^i-penalty on 
the inverse cov ariance matrix. They called their method the MissGLasso, as an extension 
of the GLasso ( Friedman et a/.l . 120071 ) for mis sing data. A similar approac h, in the context 
of so-called transposable models, is given bv I Allen and Tibshiranil ( 20ld ). 



The MissGLasso uses an EM algorithm for optimization of the penalized observed log- 
likelihood. Roughly, the algorithm can be summarized as follows. In the E-Step, for 



each sample, the regression coefficients of the missing against the observed variables are 
computed from the current estimated covariance matrix S. In the following M-Step, the 
missing values are imputed by linear regressions and S is re-estimated by performing a 
GLasso on completed data. There are two main drawbacks of this algorithm in a high- 
dimensional context. First, the E-Step is rather complex as it involves (for each sample) 
inversion and multiplication of large matrices in order to compute the regression coeffi- 
cients. Secondly, a sparse inverse covariance does not imply sparse regression coefficients 
while we believe that in high-dimensions, sparse regression coefficients would enhance 
imputations. 

Our new algorithm, MissPALasso (Missingness Pattern Alternating Lasso algorithm) in 
this paper, generalizes the E-Step in order to resolve the disadvantages of the MissGLasso. 
In particular, inversion of a matrix (in order to compute the regression coefficients) will 
be replaced by a simple soft-thresholding operator. In addition, the regression coefficients 
will be sparse, which leads to a new sparsity concept for missing data estimation. 

In order to motivate the MissPALasso, we develop first the Missingness Pattern Alter- 
nating maximization algorithm (MissPA) for optimizing the (unpenalized) observed log- 
likelihood. The MissPA geiieralize s the E- and M-Step of the traditional EM originally 



proposed by iDempster et al.\ (119771 ) by alternating between different complete data spaces 
and performing the E-Step incrementally. Such a generalization does not fit into any of the 
existing methodologies which extend the standard EM. However, by exploiting the spe- 
cial structure of our procedure and applying standard properties of the Kullback-Leibler 
divergence, we prove convergence to a stationary point of the observed log- likelihood. 

The further organization of the paper is as follows: Section 2 introduces the setup and 
the useful notion of missingness patterns. In Section 3 we develop our algorithms: the 
pattern alternating maximization algorithm (MissPA) is presented in Section [3.21 and the 
MissPALasso then follows in Section [3.31 as an adapted version for high-dimensional data 
with p ^ n. Section 4 compares performance of MissPALasso with other imputation 
methods on simulated and real data and reports on computational efficiency. Finally, in 
Section 5, we present some mathematical theory which describes the numerical properties 
of the pattern alternating maximization algorithm. 



2 Setup 

We assume X = {Xi, . . . ,Xp) ~ A/'(/U, S) has a p-variate normal distribution with mean 
fi and covariance S. In order to simplify the notation we set without loss of generality 
/U = 0: for /i ^ 0, some of the formulae involve the parameter n and an intercept column 
of (1, . . . , 1) in the design matrices but conceptually, we can proceed as for the case with 
f^ = 0. We then write X = (Xobsi^mis), where X represents an i.i.d. random sample of 
size n, Xobs denotes the set of observed values, and Xmis the missing data. 

Missingness patterns and different parametrizations For our purpose i t will be 



convenient to group rows of the matrix X according to their missingness patterns (jSchafer , 



19971 ). We index the unique missingness patterns that actuahy appear in our data by 



k = 1, . . . ,s. Furthermore, with Ok C {1, . . . ,p} and nik = {1, . . . ,p} \ Ofc we denote the 
set of observed variables and the set of missing variables, respectively. Ik is the index set 
of the samples (row numbers) which belong to pattern k, whereas I^ = {1, . . . ,n} \Ik 
stands for the row numbers which do not belong to that pattern. By convention, samples 
with all variables observed do not belong to a missingness pattern. 

Consider a partition X = {Xoj.,Xra,.) of a single Gaussian random vector. It's well known 
that Xmf,\Xo^ follows a linear regression on Xq^. with regression coefficients -B^^i^^. and 
covariance '^^nikW given by 

^nH:\o^. = ^mk,Ok^Ok ' (^-1) 

V I — T — y T^^T 

Consequently, we can write the density p(x; S) of X as 

i.e., the density can be characterized by either the parameter S or (Soj.,i?„^|o^, S„^|o^). 
With the transformation (12. ID we can switch between both parametrizations. 

Observed log-likelihood and Maximum Likelihood Estimation (MLE) A sys- 
tematic approach to estimate the parameter of interest S from Xobs maximizes the ob- 
served log-likelihood ^(S;Xobs) given by 

s 

^(S;Xobs)= Y. logp(x^;S) + ^^logp(x,,o,;S,J. (2.2) 

Inference for S can be based on the observed log-likelihood (j2.2p if the underlying missing 
data mechanism is ignorable. The missing data mechanism is said to be ignorable if the 
probability that an observation is missing may depend on Xobs but not on Xmjs (Missing 
at Random) and if the parameters of the data model and the paramete r s of th e missingness 



mechanism are distinct. For a precise definition see lLittle and RubinI (jl987l ) 



Explicit maximization of £(S; Xobs) is only possible for special missing d ata patterns. Most 



prorainent are exam ples with a so-called monotone missing data pattern (JLittle and RubinI . 



19871 : ISchaferl . 119971 ) , where Xi is more observed than X2 , which is more observed than X3 , 
and so on. In this case, the observed log-likelihood factorizes and explicit maximization 
is achieved by performing several regressions. For a general pattern o f missing data , 



the traditional EM algorithm is often used for optimization of ()2.2p . See ISchafeii ( 19971 ) 



for a detailed description of the algorithm. In the next section we present an alternative 
method for maximizing the observed log- likelihood. We will argue that this new algorithm 
is computationally more efficient than the traditional EM. 



3 Pattern Alternating Missing Data Estimation and £i- reg- 
ular izat ion 

For each missingness pattern, indexed by A; = 1, . . . , s, we introduce some further notation: 

X.'^ = (xij) with ielk j = l,...,p 
X-~'' = {xij) with iel^ j = l,...,p. 

Thus, X'^ is the \Ik\ ^P submatrix of X with rows belonging to the A;th pattern. Similarly, 
X~'^ is the |I^| X p matrix with rows not belonging to the /cth pattern. In the same way 
we define Xq^,X^^,X~^ and X~'^. For example, X^^ is defined as the \lk\ x \ok\ matrix 
with 

^ofc = {xi,j) with i G Xfc, j G Ofc. 

3.1 MLE for Data with a Single Missingness Pattern 

Assume that the data matrix X has only one single missingness pattern, denoted by s. 
This is the most simple example of a monotone pattern. The observed log-likelihood 
factorizes according to: 



£(E;Xobs) = X]^°SP(a;i,o,;SoJ + ^logp(xi;S) 

n 
= Yl ^°S Pixi,Os ; So J + ^ log piXi^ms \Xi,Os ; Bm,\os > ^m, |o, ) • (3-3) 



«=i ieXi 



The left and right part in Equation (j3.3p can be maximized separately. The first part 
is maximized by the sample covariance of the observed variables based on all samples, 
whereas the second part is maximized by a regression of the missing against observed 
variables based on only the fully observed samples. In formulae: 



So, =*Xo,Xo7n, (3.4) 

and 

r> t-v-—s-v-—s{t-v-—s-v'—s\ — l 

^ms\os = O^ml ~ -^o/ Bms\os)0^ms ~ "^o/ -^m^loj/l^^s l- (3-5) 

Having these estimates at hand, it is easy to impute the missing data: 



Xi,m, = Bm,\os Xi,o, for ah i £ls, or, in matrix notation, X^, = X^, ^m,|o,- 

It is important to note, that, if interested in imputation, only the regression part of the 
MLE is needed and the estimate Sqs in (|3.4p is superfluous. 



3.2 MLE for General Missing Data Pattern 

We turn now to the general case, where we have more than one missingness pattern, 
indexed by k = 1, . . . , s. The general idea of the new algorithm is as follows. Assume 
we have some initial imputations for all missing values. Our goal is to improve on these 
imputations. For this purpose, we iterate as follows: 

• Keep all imputations except those of the 1st missingness pattern fixed and compute 
the single pattern MLE (for the first pattern) as explained in Section [3.11 In partic- 
ular, compute the regression coefficients of the missing 1st pattern against all other 
variables (treated as "observed"). 

• Use the resulting estimates (regression coefficients) to impute the missing values 
from only the 1st pattern. 

Next, turn to the 2nd pattern and repeat the above steps. In this way we continue cycling 
through the different patterns until convergence. 

We now describe the Pattern Alternating Maximization algorithm which makes the above 
idea precise. Let T = *XX be the sufficient statistic in the multivariate normal model. 
Furthermore, denote by T^ = *(X'=)X'=, T-^ = *(X-'=)X-'= = Y.i^k'^^ ■ Let T and 
T^ {k = 1, . . . , s) be some initial guess of T and T^ [k = 1, . . . , s), for example, using 
zero imputation. Our algorithm proceeds as follows. 

Missingness Pattern Alternating Maximization Algorithm (MissPA) 

(1) r, T^: initial guess of T and T'^' (A; = 1, . . . , s). 

(2) For A; = l,...,s do: 

M-Step: Compute the MLE B^^^^^, and Sm,^|o^, based on T^^ = T — T^: 

Partial E-Step: 

Setr^' = E[r^|x^^,^„,i,,,s„,i„j, 

Update T = T-^ + T^. 

(3) Repeat step (2) until some convergence criterion is met. 

(4) Compute the final maximum likelihood estimator S via: 

Note, that we refer to the maximization step as M-Step and to the imputation step as 
partial E-Step. The word partial refers to the fact that the expectation is only performed 
with respect to samples belonging to the current pattern. The partial E-Step of our 
algorithm takes the following simple form: 

q-k Vx'^ 'iX'^ 

'-j-'k V'V'^ \^^ -\- It IV 



where X^^ - E[X^JX^^, 5^^|o^, S^^i^J - X^^*S^^|o^. 

Our algorithm does not require an evaluation of Sq^. in the M-Step, as it is not used in 
the following partial E-Step. But, if we are interested in the observed log-likelihood or the 
maximum likelihood estimator S at convergence, we compute Sq^ (at convergence), use 
it together with -B^^jo^ and S^^i^^ to get S via the transformations (I2.ip as explained in 
step (4). 

The MissPA is computationally more efficient than the traditional EM for missing data: 
one cycle through all patterns {k = 1, . . . , s) takes about the same time than one iteration 
of a usual EM. But our algorithm makes more progress since the information from the 
partial E-Step is utilized immediately to perform the next M-Step. We will demonstrate 
empirically the gain of computational efficiency in Section 14.21 

The new MissPA generalizes the traditional EM in two ways. First, the MissPA al- 
ternates betweeii differ ent complete data spaces in the sense of the SA GE algorithm of 



Fessler and Herd (j 19941 ). Secondly, the E-Step is performed incrementally (jNeal and Hintonl . 
19981;). 

There has been no framework for both of these generalizations. In Section [5.21 we will in- 
troduce an appropriate framework for proving numerical convergence to stationary points 
of the observed log-likelihood £(I1; Xobs)- 

Remark 3.1. A slight modification of the MissPA, namely replacing everywhere in the M- 
Step T^ by T res ults in an alterna t ive al gorithm which can be viewed as an incremental 
EM in the sense o uNeal and Hintort \l998i) . Furthermore, we can adapt the partial E-Step 



using the updating rule T = jT + T , with 7 = 1 — \Zk\/n. Here, it is important to update 
T also with respect to the fully observed observations. Note also, that we only need to save 
T, whereas the MissPA also requires T'^ , for k = 1, . . . ,s. However, the described modified 
algorithm is typically inferior to the MissPA. 

3.3 High-Dimensionality, Lasso Penalty and a Coordinate Descent Ap- 
proximation 

It is clear that in a high-dimensional framework with p ^ n some regularization is neces- 
sary. In the M-Step of the MissPA we solve a multivariate regression p roblem. The mai n 
idea is, in order to regularize, to replace the regressions with a Lasso ( Tibshiranil . Il996l ). 
We give now the details. 



Estimation of iJ^^ig^: The estimation of the multivariate regression coefficients in the 
M-Step of the MissPA can be expressed as \mk\ separate minimization problems of the 
form 

B,io, = arg min -7^-^/3 + '(3%-', J/2, 

where j G ruk- Here, Bj^^^ denotes the jth row vector of the ([mfe[ x |oA,.|)-matrix B^^^g^ 
and represents the regression of variable j against the variables from o^ . 



Consider now the objective function 

-T-o\P + 'Hr-XHn + A||/3||i, (3.6) 

with an additional Lasso penalty. Instead of minimizing (13.6(1 with respect to (3 (for all 
J € rn-jt), it is computationally much more efficient to improve it coordinate- wise only from 
the old parameters (computed in the last cycle through all patterns). For that purpose, 

(r) . (r) 

let B \ be the regression coefficients for pattern k in cycle r and B., its ith row 

vector. In cycle r + 1 we compute B ., by minimizing (|3.6p with respect to each of the 
components of /3, holding the other components fixed at their current value. 

Closed-form updates have the form: 



'l,i 



i?,V = '^_; -, for all leok, (3.7) 

where 



• -B .[, is the /th component of i3 .[ equal to the element (?', /) of matrix B , . 

j\l J" j\ok ^ ^•'^ ' mk\ok 

• Si, the gradient of —Tf^Ji + */37^~o^/3/2 with respect to (3i, which equals 

^^ = -^j,' + E ^4^"' + ^'^0 + E V^S- (3-8) 

V<1 V>1 

{z - A if z > A 
z + X if2;<— A,is the standard soft-thresholding operator. 
if \z\ < A 

In a sparse setup the soft-thresholding update ()3.7p can be evaluated very quickly as / 
varies and often coefficients which a re zero remain z e ro aft er thresholding. See also the 
update idea of iFriedman et al\ ( 20081 ) for efficient computation of 



naive- or covariance 
(13:71) and (ESD. 



Estimation of S„^u^: We update the residual covariance matrix as: 

y^(r+l) ^ /-^-fe _n—k tnir+l) _ -n{r+l) ^-k i o(''+l) 7— ^ tnC'^+l) ^ /Ixci |"? q^ 



Formula (|3.9p can be viewed as a generalized version of Equation (j3.5p , when multiplying 
out the matrix product in (13. 5j) and taking expectations. 



Our regularized algorithm, the MissPALasso, is summarized in Table [TJ Note, that the 
partial E-Step remains the same as for the MissPA of Section 13.21 As we are mainly 
interested in estimating the missing values, we will output the data matrix with missing 
values imputed by the regression coefficients -B^^i^^ (/c = 1, . . . , s) as indicated in step (4) 

8 



of Table [H The MissPALasso provides not only the imputed data matrix X but also 
T, the completed version of the sufficient statistic *XX. The latter can be very useful if 
the MissPALasso is used as a pre-processing step followed by a learning method which 
is expressible in terms of the sufficient statistic. Examples include regularized regression 
(e.g., Lass o), discriminant ana lysis, or estimation of directed acyclic graphs with the PC- 
algorithm (ISpirtes et al\ . 12000) • 



By construction, the regression estimates B^^^^^ are sparse, due to the employed £i- 
penalty, and therefore, imputation of missing values X^^ = X^^ ^B^^\^^ is based on sparse 
regressions. This is in sharp contrast to the MissGLasso approach (see Section [4. Ij) which 
places sparsity on S^^. But this does not imply that regressions of variables in rrik on 
variables in Ok are sparse since the inverse of sub-matrices of a sparse S~^ are not sparse 
in general. The MissPALasso employs another type of sparsity and this seems to be the 
main reason for its better statistical performance than MissGLasso. 

Remark 3.2. In practice, we propose to compute the MissPALasso for a decreasing se- 
quence of values for \, using each solution as a warm start for the next problem with 
smaller A. This pathwise strategy is computationally very attractive and our algorithm 
converges (for each X) after a few cycles. 



Missingness Pattern Alternating Imputation and ^i-Penalty (MissPALasso) 

(1) Set r = and start with initial guess for T, T*^ and B , (k = 1, . . . ,s). 

(2) In cycle r + 1; for /c = 1, . . . , s do: 

M-Step: 

For ah j £mk, compute s][+^) by improving -7^~^^/3 + '^To'^oJ/^ + MMi 

(r) 

in a coordinate- wise manner from B\/ . 

c„f yir+^) ^ (--r-k _q--k tr>{i~+l) _ nCf+l) -T--A: i d(^'+1) 7--fc tTjir+'i-)\ /\jc\ 
niklok \^'"ife,"ife ''mk,Ok mk\ok mk\ok "k^^k''' mk\ok Ok>Ok -^mklok J ' ' k\ 

Partial E-Step: 

Setr'= = E[r'=|xl,i?(^"\'\s('^Y^], 
Update r = r-'= + r'=. 

Increase: r -^ r + 1. 

(3) Repeat step (2) until some convergence criterion is met. 

(4) Output the imputed data matrix X, with missing values estimated by: 
"X"'^ — "v'^ *R , h — ^ a 



Table 1: MissPALasso. In the /cth M-Step of cycle r -|- 1, instead of a multivariate Lasso 
regression, a coordinate descent approximation of the corresponding Lasso problem is 
performed. 



4 Numerical Experiments 

4.1 Performance of the MissPALasso 

In this section we will explore the performance of the MissPALasso developed in Sec- 
tion 13. 3i We compare our new method with alternative ways of imputing missing values 
in high-dimensional data. 

We consider the following methods: 

• Knnlmpute: Impute the missing values by th e K-nearest neighbors imputation 
method introduced by iTroyanskaya et al\ (J200ll ) . 



• 



Softlmpute: The soft imputation algorithm is proposed bv lMazumder et al\ ( 20091 ) 
in order to solve the matrix completion problem. They propose to approximate the 
incomplete data matrix X by a complete (low-rank) matrix Z minimizing 

— 2_^ {zij — Xij) + A||Z||*. 

Here, O denotes the indices of observed entries and ||Z||* is the nuclear norm, or the 
sum of the singular values. The missing values of X are imputed by the corresponding 
values of Z. 

MissGLasso: Compute S by minimizing —£{Ti]^obs) + -^11^ ""^llij where HS""*^!!! is 
the entrywise £i-norm. Then, use this estimate to impute th e missing values by con 



ditional mean imputation. The MissGLasso is described in IStadler and Biihlmann 
(l20ld ). 



• MissPALasso: This is the method introduced in Section 

All methods involve one tuning parameter. In Knnlmpute we have to choose the number 
K of nearest neighbors, while Softlmpute, MissGLasso and MissPALasso involve a regu- 
larization parameter which is always denoted by A. In all of our experiments we choose 
the tuning parameters to obtain optimal performance for imputation. 

To assess the perf ormances of the methods we use the normalized root mean squared error 
( Oba et al.l . I2OO3I ) which is defined by 



NRMSE 



mean f (Xti'^t! _ x)2 
\ var (Xt™c) 



Here, X*''"'^ is the original data matrix (before deleting values) and X is the imputed 
matrix. With mean and var we abbreviate the empirical mean and variance, calculated 
over only the missing entries. 

4.1.1 Simulation Study 

We consider the MVN model ~ 7Vp(0, S) with 

10 



50; S: block diagonal with 25 blocks of the form ( Q^g ^f)- 
S : two blocks Bi, B2 each of size 50 x 50 with Bi 



Model 1: p 

Model 2: p = 100 

(B2)jj' = 0.9lJ-j'l. 

Model 3: p = 55; S : block diagonal with 6=1,. 
size 6x5, with (B(,)jj/ = 0.9 (j / j') and (Bb)j,j 

Model 4: p = 500; S^j, = O.gl^'^^'l for j,j' = 1, . . 



I50 and 



10 (increasing) blocks Bf, of the 



,P- 



For all four settings we perform 50 independent simulation runs. In each run we generate 
n = 50 i.i.d samples from the model. We then delete randomly 10%, 20% and 30% of the 
values in the data matrix, apply an imputation method and compute the NRMSE. The 
results of the different imputation methods (optimally tuned) are reported in Table [2l 
MissPALasso is very competitive in all setups. Softlmpute works rather poorly, perhaps 
because the resulting data matrices are not well approximable by low-rank matrices. Kn- 
nlmpute works very well in model 1 and model 4. Model 1, where each variable is highly 
correlated with its neighboring variable, represents an example which fits well into the 
Knnlmpute framework. However, in model 2 and model 3, Knnlmpute performs rather 
poorly. The reason is that with an inhomogeneous covariance matrix, as in model 2 and 
3, the optimal number of nearest neighbors is varying among the different blocks, and a 
single parameter K is too restrictive. For example in model 2, variable four is correlated 
to no other variable, whereas variable twenty-eight is correlated to several other variables. 
The MissGLasso works well in model 2 and model 3, but significantly worse than the 
MissPALasso in model 1 and model 4. Arguably, we consider here only multivariate nor- 
mal models which are ideal, from a distributional point of view, for MissGLasso and our 
MissPALasso. The more interesting case will be with real data (all from genomics) where 
model assumptions never hold exactly. 





Knnlmpute 


Softlmpute 


MissGLasso 


MissPALasso 


Model 1 10% 
20% 
30% 


0.5228 (0.0051) 
0.5948 (0.0052) 
0.6643 (0.0046) 


0.7447 (0.0038) 
0.8149 (0.0033) 
0.8810 (0.0028) 


0.5866 (0.0057) 
0.6822 (0.0046) 
0.7767 (0.0041) 


0.5389 (0.0055) 
0.6152 (0.0044) 
0.6883 (0.0041) 


Model 2 10% 
20% 
30% 


0.8502 (0.0049) 
0.8613 (0.0036) 
0.8717 (0.0028) 


0.8619 (0.0032) 
0.8745 (0.0025) 
0.8836 (0.0024) 


0.7909 (0.0039) 
0.8042 (0.0027) 
0.8179 (0.0029) 


0.7846 (0.0038) 
0.7956 (0.0030) 
0.8052 (0.0029) 


Model 3 10% 
20% 
30% 


0.4543 (0.0057) 
0.4683 (0.0045) 
0.4937 (0.0037) 


0.4856 (0.0042) 
0.5093 (0.0032) 
0.5447 (0.0035) 


0.4069 (0.0047) 
0.4194 (0.0033) 
0.4451 (0.0036) 


0.4136 (0.0049) 
0.4204 (0.0034) 
0.4388 (0.0034) 


Model 4 10% 
20% 
30% 


0.3721 (0.0014) 
0.4139 (0.0011) 
0.4485 (0.0011) 


0.7933 (0.0015) 
0.8157 (0.0013) 
0.8394 (0.0011) 


0.4211 (0.0012) 
0.4548 (0.0012) 
0.4990 (0.0012) 


0.3823 (0.0012) 
0.4038 (0.0011) 
0.4318 (0.0012) 



Table 2: Average (SE) NRMSE of Knnlmpute, Softlmpute, MissGLasso and MissPALasso 
with different degrees of missingness. 
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4.1.2 Real Data Examples 



We consider the following four publicly available datasets: 

• Isoprenoid gene network in Arabidopsis thaliana: The number of genes in the net- 
work isp = 39. The number of observations (gene expression profiles), corresponding 
to differ ent experimental co nditions, is n = 118. More details about the data can be 
found in lWille et all jiooi)- 



Colon cancer: In this dataset, expression levels of 40 tumor and 22 normal colon 
tissues (n = 62) f or p = 2000 human genes are measured. For more information see 
Alon et ad (I1999I ). 

Lymphoma: This dataset, presented in lAlizadeh et all (J200d ). contains gene expres- 
sion levels of 42 samples of diffuse large B-cell lymphoma, 9 observations of follicular 
lymphoma, and 11 cases of chronic lymphocytic leukemia. The total sample size is 
n = 62, and the expression of p = 4026 well-measured genes is documented. 



• Yeast cell-cycle: The dataset, described in lSpellman et al\ (jl998l ). monitors expres- 
sions of p = 6178 genes. The data consists of four parts, which are relevant to alpha 
factor (18 samples), elutriation (14 samples), cdclS (24 samples), and cdc28 (17 
samples). The total sample size is n = 73. 

To keep the computational effort reasonable we use for datasets 2-4 only the 100 variables 
(genes) exhibiting the highest empirical variances. For all datasets we standardize the 
columns (genes) to zero mean and variance of one. The isoprenoid and the colon cancer 
dataset contain no missing values. In the reduced lymphoma dataset 5.63% and in the 
reduced yeast cell-cycle dataset 5.52% of the values are missing. In order to compare the 
performance of the different imputation methods we delete additional values (randomly) 
to obtain an overall missing rate of 10%, 20% and 30%. Table [3] shows the results for 50 
simulation runs, where in each run another random set of values is deleted. 





Knnlmpute 


Softlmpute 


MissGLasso 


MissPALasso 


Arabidopsis 10% 
20% 
30% 


0.7723 (0.0073) 
0.8100 (0.0039) 
0.8354 (0.0039) 


0.7222 (0.0052) 
0.7440 (0.0035) 
0.7683 (0.0023) 


0.7237 (0.0064) 
0.7521 (0.0045) 
0.7741 (0.0025) 


0.7178 (0.0062) 
0.7469 (0.0045) 
0.7691 (0.0024) 


Colon cancer 10% 

20% 
30% 


0.6371 (0.0034) 
0.6519 (0.0027) 
0.6683 (0.0019) 


0.5703 (0.0024) 
0.5930 (0.0022) 
0.6183 (0.0015) 


0.5674 (0.0031) 
0.5883 (0.0028) 
0.6091 (0.0020) 


0.5634 (0.0026) 
0.5842 (0.0025) 
0.6057 (0.0018) 


Lymphoma 10% 
20% 
30% 


0.5090 (0.0052) 
0.5393 (0.0035) 
0.5753 (0.0025) 


0.4936 (0.0039) 
0.5248 (0.0023) 
0.5615 (0.0017) 


0.4127 (0.0047) 
0.4396 (0.0033) 
0.4746 (0.0023) 


0.4034 (0.0045) 
0.4214 (0.0030) 
0.4465 (0.0021) 


Yeast cell-cycle 10% 
20% 
30% 


0.5902 (0.0053) 
0.6160 (0.0025) 
0.6473 (0.0021) 


0.4872 (0.0041) 
0.5439 (0.0021) 
0.5592 (0.0017) 


0.4628 (0.0046) 
0.5035 (0.0026) 
0.5317 (0.0021) 


0.4541 (0.0050) 
0.4969 (0.0028) 
0.5272 (0.0022) 



Table 3: Average (SE) NRMSE of Knnlmpute, Softlmpute, MissGLasso and MissPALasso 
for different real datasets from genomics. 
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MissPALasso exhibits in all setups the lowest averaged normalized root mean squared er- 
ror. The MissGLasso performs nearly as well as the MissPALasso. Softlmpute works well 
for the arabidopsis and colon cancer datasets but significantly worse for the remaining two 
datasets. Interestingly, Knnlmpute performs for all datasets much inferior to the Miss- 
PALasso or the MissGLasso. In light of the simulation results of Section 14.1.11 a reason 
for the poor performance could be that Knnlmpute has difficulties with the inhomoge- 
neous correlation structure between different genes which is plausible to be present in real 
datasets. 

The lymphoma and yeast cell-cycle datasets have "real" missing values. From the left 
panel of Figures [T] and [2] we can read off how many values are missing for each of the 100 
variables. In the right panel of Figures 1 and 2 we show how well the different methods 
are able to estimate 2%, 4%, 6% . . . , 16% of artificially deleted entries. 

4.2 Computational EfRciency 



We compare the MissPA with the standard EM described for example in ISchafeiJ (|1997| ). 
The reason why our algorithm takes less time to converge is because of the more frequent 
updating of the latent distribution in the M-Steps. A key attribute of the MissPA is 
that the computational cost of one cycle through all patterns is the same as the cost of a 
single E-Step of the stand ard EM. This is a big contrast to the incremen tal EM, mostly 
applied to finite mixtures ( Thiesson et al.l . l200ll : lNg and McLachlanl . l2003l ) . where there is 



a trade-off between t he ad ditional computation time per cycle, or "scan" in the language of 
Ng and McLachlanl ( 20031 ) . and the fewer number of "scans" required because of the more 



frequent updating after each partial E-Step. The speed of convergence of the standard 
EM and the MissPA for three datasets are shown in Figure [3l in which the log-likelihood 
is plotted as a function of the number of iterations (cycles). The left panel corresponds to 
the subset of the lymphoma dataset when only the ten genes with highest missing rate are 
used. This results in a 62 x 10 data matrix with 22.85% missing values. For the middle 
panel we draw a random sample of size 62 x 10 from A/io(0, S), Sj j/ = Q.S)\^~^ \ and delete 
the same entries which are missing in the reduced lymphoma data. For the right panel 
we draw from the multivariate t-model with degrees of freedom equal to one and again 
with the same values deleted. As can be seen, the MissPA converges after fewer cycles. A 
very extreme example is obtained with the multivariate t-model where the standard EM 
reaches the log-likelihood level of the MissPA about 400 iterations later. We note here, 
that the result in the right panel highly depends on the realized random sample. With 
other realizations, we get less and more extreme results than the one shown in Figure [3l 

We end this section by illustrating the computational timings of the MissPALasso imple- 
mented with the statistical computing language and environment R. We consider model 
4 of Section 14.1.11 with n = 50 and a growing number of variables p ranging from 10 to 
500. For each p we delete 10% of the data, run the MissPALasso on a decreasing grid (on 
the log-scale) of A values with 30 grid points. For a fixed A we stop the algorithm if the 
relative change in imputation satisfies, 
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Figure 1: Lymphoma dataset. Left panel: Barplots which count the number of missing 
values for each of the 100 genes. Right panel: NRMSE for Knnlmpute, Softlmpute, 
MissGLasso and MissPALasso if we introduce additional 2%, 4%, 6%, . . . , 16% missings. 
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Figure 2: Yeast cell-cycle dataset. Left panel: Barplots which count the number of missing 
values for each of the 100 genes. Right panel: NRMSE for Knnlmpute, Softlmpute, 
MissGLasso and MissPALasso if we introduce additional 2%, 4%, 6%, . . . , 16% missings. 
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Figure 3: Log-likelihood as a function of the number of iterations (cycles) for the standard 
EM and the MissPA. Left panel: subset of the lymphoma data (n = 62, p = 10 and 22.85% 
missing values). Middle panel: random sample of the size 62 x 10 from the multivariate 
normal model with the same missing entries as in the reduced lymphoma data. Right 
panel: random sample of the size 62 x 10 from the multivariate t-model again with the 
same missing values. 

In Figure H] the CPU time in seconds is plotted for various values of p. As shown, we are 
typically able to solve a problem of size p = 100 in about 5 seconds and a problem of size 
p = 500 in about 100 seconds. 



5 Convergence Theory for the MissPA algorithm 



We will now discuss the numerical properties of the MissPA from Section 13.21 As already 
mentioned, the pattern alternating maximization algorithm extends the EM in two ways. 
First, for each pattern a different complete data space is used. Namely, say for pattern k, 
only those samples are augmented which do not belong to pattern k. Secondly, the E-Step 
is performed only on those samples belonging to a single pattern. 



Fessler and Herd ( 19941 ) introduce the SAGE algorithm which generalizes the traditional 
EM in the sense that it allows different data augmentation schemes. They show mono- 
tonicity of the log-likelihood. Nevertheless, the SAGE methodology can not be used here, 
as generalizations of the E-Step are not allowed in their framework. 

On the other hand, iNeal and HintonI ( 19981 ) give a framework which justifies variants of 
the EM that allow generalizing the E-Step. One example is the incremental EM, where 
in each E-Step the expectation is only for one (or a few) samples recalculated before re- 
estimating the parameters of interest in the following M-Step. The price to pay for such 
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Figure 4: CPU times (in seconds) vs. problem size p of the MissPALasso applied on a 
grid of thirty A values. Sample size is n = 50. 

a general ization is that monotonic i ty of the log-likelihood is not guaranteed anymore. 
However, iGunawardana and Bvrnd (120051) show that such an al gorithm still converges to 
a stat ionary point of the likelihood. iNeal and HintonI ( 19981 ) and IGunawardana and Byrne 
(J2005l ) assume a fixed complete data space in their theory which does not apply to our 
MissPA. 

As outlined, neither the SAGE methodology nor the framework from the incremental EM 
are suitable to our proposed algorithm. In Section [52] we provide a new framework and ar- 
guments which justify alternating between complete data spaces and partially performing 
the E-Step. The MissPA will not necessarily increase the log-likelihood in each EM-Step. 
However, in Section 15.31 we prove convergence to a stationary point of the observed log- 
likelihood. 

Before that, we discuss the convergence properties of two special cases. In a situation 
with only one missingness pattern (s = 1), explicit maximization is possible as discussed 
in Section 13.11 and therefore, MissPA finds the optimal solution after a single M-Step. 
Another special case occurs with s = 2. With two patterns it is easy to recognize that the 
E-Step is performed on all augmented data. Thus, the E-Step has not anymore a partial 
character and the MissPA is a special case of a SAGE algorithm for which monotonicity 
of the log-likelihood is guaranteed. 



5.1 Pattern-Depending Lower Bounds 

In order to simplify the notation we assume from now on that all observations have missing 
values, i.e., for every i G {1, . . . , n} there is a pattern k such that i £ Ik- Denote the density 
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Ps(Xfc)ofX^A;€{l,...,s}, by 

Similarly, we define for /c, / e {1, . . . , s} 

Pe(X^J = JJp(xi,Ofe;SoJ and 

ieXi 

Set {Si}/^fc = (Si, . . . , Sfc-i, Sfc+i, . . . , Ss) and consider for /c = 1, . . . , s 

^fc[Sfc|l{Sj,^fc] = logPs,(X,\) + ^(EsJlogPs,(X')|Xij + ?^,pd)- 

Here "H/p] = — Ej^[logPj;,(Xj„jXQj|XQj denotes the entropy. Note that Fk is defined 
w.r.t fixed observed data Xobs- The subscript k highlights the dependence on the pattern 
k. Furthermore, for fixed Xobs and fixed k, J-^ is a function in the parameters (Si, . . . , S^). 

As a further tool we write the Kullback-Leibler divergence in the following form: 

Pz[S||S]=E^[-log(Ps(X^jXij/P^(X^jXi^))|Xij. (5.10) 

An important property of the Kullback-Leibler divergence is its non-negativity: 

2?/[S||S] > 0, with equality if and only if 

Ps(XmjXoJ = Pe(X^JXoJ. 

A simple calculation shows that 

E^[logPs(X')|Xij +?^,[S] = -P,[S|1S] +logPE(Xi^). (5.11) 

Using Equation (|5.1ip . Fk\^k\\{^i}ii=k\ takes the convenient form 

Jfc[Efcl|{Sj,^fc] = £(Sfc;Xobs)-5^^/[Sd|Sfc]. (5.12) 

In particular, for fixed values of {Tji}i^k-, Fk[- \\{Tji}i^k\ lower bounds the observed log- 
likelihood l{ ■ ; Xobs) due to the non-negativity of the Kullback-Leibler divergence. 

5.2 MissPA: Optimization Transfer to Pattern-Depending Lower Bounds 

We give now an alternative description of the MissPA algorithm. In cycle r + 1 through 
all patterns, generate (S^"*" , . . . , Sg+-^) given {11\, . . . , YZ.) according to 

S^^+i = argmax7-fc[S|Z^^+i], /c = l,...,s, (5.13) 



s 
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We have 



Kk 
l>k 



The entropy terms do not depend on the optimization parameter S, therefore, 

J-fc[E||Z^,+i] = const + logPs(X^J + 5]E2.+i[logPs(XO|Xij + J^Esj-[logPs(X')|Xij. 

Kk l>k 

Using the factorization log Ps(X') = logP(X;,^,; SoJ+logP(X|„jX^^; 5^^l„^, S„^I„J (for 
all / / k), and separate maximization w.r.t Sq^. and {B^^\o^,Tj^^\f,^) we end up with the 
expressions from the M-Step of the MissPA. Summarizing the above, we have recovered 
the M-Step as a maximization of 7-"^ [ S 1 1 Z^^"*" ] whi c h is a lower bound of the observed 
log-likelihood. Or in the language of iLange et al\ (J2000l ). optimization of £(-;Xobs) is 



transferred to the surrogate objective Fk[- ||Z^^ ]. 

There is still an important piece missing: In M-Step k of cycle r -|- 1 we are maximizing 
Tk[ ■ IIZ^"^ ] whereas in the following M-Step {k + 1) we optimize J^k+i[ ■ W^kXi^' ^^ order 
for the algorithm to make progress, it is essential that Tk+i[ ■ ||Z^i]^] attains higher values 

k 



than its predecessor Tf^[ ■ ||ZI^ ]. In this sense the following Proposition is crucial 



Proposition 5.1. For r = 0, 1, 2, . . . we have that 

/•,[S^||Z',]<7-i[S^||Z^J+i], and 

J-,[E^/i||Z^,+i] < 7-,+i[S^.+il|Z^i] fork = l,...,s-l. 

Proof. We have, 

7-,[S^+il|Z^^+i] = logP^.+i(X^J +Es.;^JlogP^.+i(X'=+i)|X,\+^\] +?^,+i[Sl+i] +rest 
and 

J-fe+i[S^,+i||Z^^+l] = log P^.+i (X^-+\ ) +E^j.+i [log P^.+i(X^)|X^J +?^,[S^/i] + rest 
where 



rest = Y^ (E^.+i[logP2.+i(X')|Xij+?^,[S[+i])+ J^ (Es^[logPs.+i(X')|Xij+?^aS[]). 

Kk l>k+l 

Furthermore, using (15. IIP and noting that I?fc[S^"*' HS]^"*" ] = 0, we obtain 
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•^fcpriizn -•^^+ipriiz^:i] =^;^prii5^ri -^^+iK+ipri 

Note that equality holds if and only if P^.+i(X^+^jX^+\) = Ps^^^(X^^+^jX^+\). D 

In light of Proposition lS.ll it is clear that (j5.13p generates a monotonely increasing sequence 
of the form: 

J-,[eO||zO] < 7-i[sO||Zi] < -FipllJZl] < F2[T.\\\7.l] < J-2p^||Z^] < ■ ■ ■ 

■ ■ • < M^l-^'W^-"'] < ^k+ii^'j^'lKXl] < ^k+i[KX\\\^'kt\] <■■■ 

For example, we can deduce that {75[S5||Z5]},r=o,i,2,... is a monotone increasing sequence 
in r. 



5.3 Convergence to Stationary Points 

Ideally we would like to show that a limit point of the sequence generated by the MissPA 
algorithm is a global maximum of i{T,; Xobs)- Unfortunately, this is too ambitious because 
for general missing data patterns the observed log-likelihood is a non-concave function with 
several local maxima. Thus, the most we can expect is that our algorithm converges to 
a stationary point. This is ensured by the following theorem which is proved in the Ap- 
pendix. 

Theorem 5.1. Assume that /C = {(Si,...,^^) : J'^p^USi, . . . , Ss_i] > J'4S°||Z°]} is 
compact. Then every limit point Tig o/ {$]^}r=o,i,2,... is a stationary point of i{ ■ ;Xobs)- 



6 Conclusions 

We have presented a new algorithm, the MissPA, for maximizing the observed log-likelihood 
for a multivariate normal data matrix with missing values. Simplified, our algorithm itera- 
tively cycles through the different missingness patterns, performs multivariate regressions 
of the missing on the observed variables and uses these regression coefficients for partial 
imputation of the missing values. We argued theoretically and gave numerical examples 
showing that the MissPA is computationally more efficient than the standard EM algo- 
rithm. Furthermore, we analyze the numerical properties using non-standard arguments 
and prove that solutions of the MissPA converge to stationary points of the observed 
log-likelihood. 

In a high-dimensional setup with p ^ n the regression interpretation of the MissPA 
opens up the door to do regularization by replacing least squares regressions with Lasso 
analogues. Our proposed algorithm, the MissPALasso, performs a coordinate descent ap- 
proximation of the corresponding Lasso problem in order to gain speed. On simulated 
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and four real datasets (all from genomics) we demonstrate that the MissPALasso outper- 
forms alternative imputation techniques such as k-nearest neighbors imputation, nuclear 
norm minimization or a penalized likelihood approach with an ^i-penalty on the inverse 
covariance matrix. 



A Proof of Theorem 15.11 

Proof. First, note that the sequence {(E^, . . . , S^)}r=o,i,2,... lies in the compact set /C. 
Now, let T,s^ be a subsequence converging to S^ as j -^ oo. By invoking compactness, we 
can assume w.l.o.g (by restricting to a subsequence) that {T,^^ , • • • , S/) -^ (Si, . . . , Eg)- 

As a direct consequence of the monotonicity of the sequence {^s[5^sl|Z^]}r=o,i,2,... we obtain 

lim j;[s:i|z^] = j;[s,||Ei, . . . , s,_i] = T. 

r 

From (|5.13p and Proposition 15.11 for k = 1, . . . , s — 1 and r = 0, 1, 2, ... , the following 
"sandwich" -formulae hold: 



r+lllrvr+li ^ 77, rv^'+l I IV'+ll <^ T TV^'+l I ly'+ll 



J-,[S^||Z^] < -Fi[S:|[Z^+i] < J-i[S^+i||Z^+i] < -F4S:+1|IZ, J, 



As a consequence we have for k = 1, . . . ,s — 1 

lim J-iKJlZr+i] = lim J-i[S:+i||Zr+i] = T and (A.14) 

r r 

hm J-fc+i[S^,+i||Z^,+l] = lnnTk+i[^lX\\\ZlX\] = T. (A.15) 

Now consider the sequence (S/ , . . . , S/ ). By compactness of /C this sequence con- 
verges w.l.o.g to some (SJ, . . . , S*). We now show by induction that 

s — ^1 — ■ ■ ■ — ^s- 

From the 1st M-Step of cycle tj + 1 we have 

J-i[S^^+^I|Z^^+^] > J-i[S|lZp+^] foraU S. 

Taking the limit j ^- oo we get: 

J^i[St||{S,h>i]>7-i[S|{Sai>i] for all S. 

In particular, S^ is the (unique) maximizer of J^i[- |[{Si}i>i]. Assuming Yj\ ^ S^ would 
imply 

Ti[T.\\\{ti}iyi]>Ti[tM^i]i>i\- 

But this contradicts J'i[S5;||{Sj,>i] = Ti[t.s\\{%}i>i\ = -F, which holds by ([Alii) . 
Therefore we obtain S^ = S^. 
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Assume that we have proven S* = . . . = S^ = S^. We will show that T,'^,-^ = Tig. From 
the k+lst M-Step in cycle rj + 1: 

J-fc+ip^^^YllO^-^fc+iPllO for all E. 
Taking the limit for j — > oo, we conclude that T,'^,^ is the (unique) maximizer of 

^k+l[- {{{^DlKk+l, {^l}l>k+l]- 

Prom (jA.lSp . 

J^k+l[^*k+l\\{^Ul<k+l,{^l}l>k+l] = ^k+l[^k\\{^l}l<k+l, {'^l}l>k+l] = J^, 



and therefore S^ , -^ must be equal to S^. By induction we have S^ = S^ and so we proved 
that S*_^^ = Es holds. 

Finally, we show stationarity of S^. Invoking (|5.12p we can write 



1=1 

Note that 

8 - _ 

0. 



d 



s. 



Furthermore, as S/ maximizes J-'s[S[|S-^^ , ■ ■ ■ , ^s-i ]' ^^ S^t in the limit as j ^ oo 



^J-4S|E„...,S,] 



_ =^J-.[S||Et,...,E:_i] 



= 0. 



Therefore, we conclude that ^£(S;Xobs; 



= 0. D 

2« 
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